Appears in the proceedings of ACL-EACL '97 

Similarity-Based Methods For Word Sense Disambiguation 



< 

00 



> 
o 

O 

00 

o 

I 



X 



Ido Dagan 

Dept. of Mathematics and 
Computer Science 
Bar Ilan University 
Ramat Gan 52900, Israel 
daganOmacs . biu .ac.il 



Lillian Lee 

Div. of Engineering and 
Applied Sciences 
Harvard University 



Fernando Pereira 

AT&T Labs - Research 
600 Mountain Ave. 
Murray Hill, NJ 07974, USA 



Cambridge, MA 01238, USA pereira@research.att.com 
lleeOeecs . harvard . edu 



Abstract 

We compare four similarity-based esti- 
mation methods against back-off and 
maximum-likelihood estimation methods 
on a pseudo-word sense disambiguation 
task in which we controlled for both 
unigram and bigram frequency. The 
similarity-based methods perform up to 
40% better on this particular task. We also 
conclude that events that occur only once 
in the training set have major impact on 
similarity-based estimates. 



1 Introduction 

The problem of data sparseness affects all statistical 
methods for natural language processing. Even large 
training sets tend to misrepresent low-probability 
events, since rare events may not appear in the train- 
ing corpus at all. 

We concentrate here on the problem of estimat- 
ing the probability of unseen word pairs, that is, 
pairs that do not occur in the training set. Katz's 
back-off scheme (Katz, 1987), widely used in bigram 
language modeling, estimates the probability of an 
unseen bigram by utilizing unigram estimates. This 
has the undesirable result of assigning unseen bi- 
grams the same probability if they are made up of 
unigrams of the same frequency. 



Class-based methods (Brown et al., 1992 



PereiraJ 



Tishby, and Lee, 1993 ; Resnik, 1992 ) cluster words 
into classes of similar words, so that one can base 
the estimate of a word pair's probability on the aver- 
aged cooccurrence probability of the classes to which 
the two words belong. However, a word is therefore 
modeled by the average behavior of many words, 
which may cause the given word's idiosyncrasies to 
be ignored. For instance, the word "red" might well 
act like a generic color word in most cases, but it 



has distinctive cooccurrence patterns with respect 
to words like "apple," "banana," and so on. 

We therefore consider similarity-based estimation 
schemes that do not require building general word 
classes. Instead, estimates for the most similar 
words to a word w are combined; the evidence pro- 
vided by word w' is weighted by a function of its 
similarity to w. Dagan, Markus, and Markovitch 
(1993) propose such a scheme for predicting which 
unseen cooccurrences are more likely than others. 
However, their scheme does not assign probabilities. 
In what follows, we focus on probabilistic similarity- 
based estimation methods. 

We compared several such methods, including 
that of Dagan, Pereira, and Lee (1994) and the cooc- 
currence smoothing method of Essen and Steinbiss 
( 1992] ), against classical estimation methods, includ- 
ing that of Katz, in a decision task involving un- 
seen pairs of direct objects and verbs, where uni- 
gram frequency was eliminated from being a factor. 
We found that all the similarity-based schemes per- 
formed almost 40% better than back-off, which is 
expected to yield about 50% accuracy in our ex- 
perimental setting. Furthermore, a scheme based 
on the total divergence of empirical distributions 
to their average^] yielded statistically significant im- 
provement in error rate over cooccurrence smooth- 
ing. 

We also investigated the effect of removing ex- 
tremely low-frequency events from the training set. 
We found that, in contrast to back-off smoothing, 
where such events are often discarded from train- 



^To the best of our knowledge, this is the first use 
of this particular distribution dissimilarity function in 
statistical language processing. The function its elf is im- 



plicit in earlier work on distributional clustering ( Pereira, 
Tishby, and Lee, 1993| ), has been used by Tishby (p.c.) 
in other distributional similarity work, and, as sug- 
gested by Yoav Freund (p.c), it is related to results of 



Hocffding (1965) on the probability that a given sample 
was drawn from a given joint distribution. 



ing with little discernible effect, similarity-based 
smoothing methods suffer noticeable performance 
degradation when singletons (events that occur ex- 
actly once) are omitted. 

2 Distributional Similarity Models 

We wish to model conditional probability distribu- 
tions arising from the coocurrence of linguistic ob- 
jects, typically words, in certain configurations. We 
thus consider pairs {wi , W2 ) G Vi x V2 for appropri- 
ate sets Vi and V2, not necessarily disjoint. In what 
follows, we use subscript i for the i"* element of a 
pair; thus P{w2\wi) is the conditional probability (or 
rather, some empirical estimate, the true probability 
being unknown) that a pair has second element W2 
given that its first element is wi] and P(wi|w2) de- 
notes the probability estimate, according to the base 
language model, that wi is the first word of a pair 
given that the second word is W2- P(w) denotes the 
base estimate for the unigram probability of word 
w. 

A similarity-based language model consists of 
three parts: a scheme for deciding which word pairs 
require a similarity-based estimate, a method for 
combining information from similar words, and, of 
course, a function measuring the similarity between 
words. We give the details of each of these three 
parts in the following three sections. We will only 
be concerned with similarity between words in Vi . 

2.1 Discounting and Redistribution 

Data sparseness makes the maximum likelihood es- 
timate (MLE) for word pair probabilities unreliable. 
The MLE for the probability of a word pair [wi ,102), 
conditional on the appearance of word wi , is simply 



Pml[w2\wi) 



c(wi, W2) 
c{wi) 



(1) 



where 0(101,^2) is the frequency of {'Wi,W2) in the 
training corpus and c['Wi) is the frequency of wi. 
However, Pml is zero for any unseen word pair, 
which leads to extremely inaccurate estimates for 
word pair probabilities. 



lem (Good, 1953; 


Jelinek, Mercer, and Roukos, 1992; 


Katz, 1987; 


Church and Gale, 1991 


) adjust the MLE 



in so that the total probability of seen word pairs 
is less than one, leaving some probability mass to 
be redistributed among the unseen pairs. In gen- 
eral, the adjustment involves either interpolation, in 
which the MLE is used in linear combination with 
an estimator guaranteed to be nonzero for unseen 
word pairs, or discounting, in which a reduced MLE 



is used for seen word pairs, with the probability mass 
left over from this reduction used to model unseen 
pairs. 

The discounting approach is the one adopted by 
Katz (1987] ): 



P{w2\wi) 



Pd{w2\wi) c{wi,W2) > 

a{wi)Pr{w2\wi) O.W. 



(2) 

where Pd represents the Good- Turing discounted es- 
timate ( Katz, 1987 ) for seen word pairs, and Pr de- 
notes the model for probability redistribution among 
the unseen word pairs. a{wi) is a normalization fac- 
tor. 



Following Dagan, Pereira, and Lee (1994), we 
modify Katz's formulation by writing Pr{w2\wi) in- 
stead P{w2), enabling us to use similarity-based es- 
timates for unseen word pairs instead of basing the 
estimate for the pair on unigram frequency P{w2)- 
Observe that similarity estimates are used for unseen 
word pairs only. 

We next investigate estimates for Pr{w2\wi) de- 
rived by averaging information from words that are 
distributionally similar to wi. 

2.2 Combining Evidence 

Similarity-based models assume that if word w'l is 
"similar" to word wi , then w[ can yield information 
about the probability of unseen word pairs involving 
wi. We use a weighted average of the evidence pro- 
vided by similar words, where the weight given to a 
particular word w'l depends on its similarity to wi. 

More precisely, let W(wi,w'i) denote an increas- 
ing function of the similarity between wi and w[, 
and let S{wi) denote the set of words most similar 
to wi. Then the general form of similarity model 
we consider is a VF-weighted linear combination of 
predictions of similar words: 



^'sim(w2|wi) = 



W{wi,w[] 
N{wi) 



Piw2\w[), (3) 



where N{wi) = J2w[es{wi) W{wi,w[) is a normal- 
ization factor. According to this formula, W2 is more 
likely to occur with wi if it tends to occur with the 
words that are most similar to wi. 

Considerable latitude is allowed in defining the 
set S{wi), as is evidenced by previous work that 
can be put in the above form. Essen and Steinbiss 
( I1992D and Karov and Edelman ( |1996| ) (implicitly) 
set S{wi) — V\. However, it may be desirable to 
restrict S(w\) in some fashion, especially if V\ is 
large. For instance, Dagan, Pereira, and Lee (1994) 



use the closest k or fewer words w'l such that the dis- 
similarity between wi and w'l is less than a threshold 
value t; k and t are tuned experimentally. 

Now, we could directly replace Pr(^'^2|^'^i) in the 
back-off equation with Psim{w2\wi). However, 
other variations are possible, such as interpolating 
with the unigram probability P(w2)'- 

Pr{w2\wi) = JP{W2) + (1 - 1)PsIm{w2\wi), 



2.3.2 Total divergence to the average 

A related measure is based on the total KL diver- 
gence to the average of the two distributions: 



A{wi, w'l) — D [ wi 
where {wi 



Wi 



-D 



Wi 



(5) 



'w'i)/2 shorthand for the distribution 
1 



-(P{-\wi) + P(-\w[)) 



Pereira, and Lee, 1994). This represents, in effect, 



where 7 is determin ed jixperimentally ( |DaganJ Since D{-\\-) > 0, A{'wi,w[) > 0. Furthermore, 

letting p{w2) = P{w2\'Wi), p'(w2) — P{w2\w'i) and 
C — {w2 ■ p{w2) > 0,p'(w2) > 0}, it is straightfor- 
ward to show by grouping terms appropriately that 



a linear combination of the similarity estimate and 
the back-off estimate: if 7 = 1, then we have exactly 
Katz's back-off scheme. As we focus in this paper 
on alternatives for Psim, we will not consider this 
approach here; that is, for the rest of this paper, 

Pr{w2\wi) = PsIm(w2|wi). 

2.3 Measures of Similarity 

We now consider several word similarity functions 
that can be derived automatically from the statistics 
of a training corpus, as opposed to functions derived 



from manually-constructed word classes (Resnik 
f 992| ) . All the similarity functions we describe below 



depend just on the base language model P(-|')i ^lot 
the discounted model P{-\-) from Section 2.1 above. 



2.3.1 KL divergence 

Kullhack-Leibler (KL) divergence is a standard 
information-theoretic measure of the dissimilarity 



between two probability mass functions (Cover and 
Thomas, 1991). We can apply it to the conditional 



distribution P{-\wi) induced by wi on words in V2: 



D{wi\\w[) 



E^(->i)log5^- (4) 

W2 



P{W2\W[) 



For D{wi\\w'i) to be defined it must be the case 
that P{w2\'w'i) > whenever P{w2\wi) > 0. Un- 
fortunately, this will not in general be the case for 
MLEs based on samples, so we would need smoothed 
estimates of P{'W2\w'i) that redistribute some proba- 
bility mass to zero-frequency events. However, using 
smoothed estimates for P(w2\wi) as well requires a 
sum over all W2 & V2, which is expensive for the 
large vocabularies under consideration. Given the 
smoothed denominator distribution, we set 



■W2ec 



+ 



{h{p{w2)+p'{w2)) 

-H{p[w2))-H{p'{w2))] 
2 log 2, 



where H{x) ~ —x\ogx. Therefore, A(wi^w'i) 
is bounded, ranging between and 2 log 2, and 
smoothed estimates are not required because proba- 
bility ratios are not involved. In addition, the calcu- 
lation oi A(wi, w'l) requires summing only over those 
W2 for which P(w2\wi) and P{w2\w'i) are both non- 
zero, which, for sparse data, makes the computation 
quite fast. 

As in the KL divergence case, we set W{wi,w'i) 
to be iQ-PHwi,0_ 

2.3.3 Li norm 

The Li norm is defined as 



L(wi,w'i) 



E 



\P{w2\wi)~P{w2\w'i)\. (6) 



By grouping terms as before, we can express 
L{wi,w'i) in a form depending only on the "com- 
mon" W2'. 

L{wi,w'i) = 2 - ^ p{w2) - E ^'(^2) 

W2GC W2£C 
+ E lf^(^2) -P'{W2)\. 

W2^C 

This last form makes it clear that < L{wi,w'i) < 2, 
with equality if and only if there are no words W2 
such that both P{w2\wi) and P{w2\w'i) are strictly 
positive. 

Since we require a weighting scheme that is de- 
creasing in L, we set 

W{wi,w'i) = i2~Liwi,w'i))^ 



where /3 is a free parameter. 



with /3 again free. 



2.3.4 Confusion probability 



Essen and Stcinbiss (1992 ) introduced confusion 
probability ^ which estimates the probability that 
word w'l can be substituted for word wi: 



Pciw[\wi 



Wiwi,w[) 

P{wi\w2)P{w[ \W2)P{'W2) 



E 



Piwi) 



Unlike the measures described above, Wi may not 
necessarily be the "closest" word to itself, that is, 
there may exist a word w[ such that Pc(w'i\wi) > 
Pciwi\wi). 

The confusion probability can be computed from 
empirical estimates provided all unigram estimates 
are nonzero (as we assume throughout). In fact, the 
use of smoothed estimates like those of Katz's back- 
off scheme is problematic, because those estimates 
typically do not preserve consistency with respect 
to marginal estimates and Bayes's rule. However, 
using consistent estimates (such as the MLE), we 
can rewrite Pq as follows: 



Pciw[\wi)^J2 



P{w2\wi) 
P(W2) 



P{w2\w[)P{w[). 



This form reveals another important difference be- 
tween the confusion probability and the functions 
D, A, and L described in the previous sections. 
Those functions rate w'l as similar to wi if, roughly, 
P{w2\w'i) is high when P{w2\wi) is. Pc{w'i\wi), 
however, is greater for those w'l for which P{w'i, W2) 
is large when P{w2\'Wi) / P{w2) is. When the ratio 
P{w2\wi)/ P{'W2) is large, we may think of u;2 as be- 
ing exceptional, since if W2 is infrequent, we do not 
expect P{w2\wi) to be large. 

2.3.5 Summary 

Several features of the measures of similarity 
listed above are summarized in table |^. "Base 
LM constraints" are conditions that must be sat- 
isfied by the probability estimates of the base lan- 
guage model. The last column indicates whether 
the weight W{wi,w'i) associated with each similar- 
ity function depends on a parameter that needs to 
be tuned experimentally. 

3 Experimental Results 

We evaluated the similarity measures listed above 
on a word sense disambiguation task, in which each 

"^Actually, they present two alternative definitions. 
We use their model 2-B, which they found yielded the 
best experimental results. 



method is presented with a noun and two verbs, and 
decides which verb is more likely to have the noun 
as a direct object. Thus, we do not measure the 
absolute quality of the assignment of probabilities, 
as would be the case in a perplexity evaluation, but 
rather the relative quality. We are therefore able to 
ignore constant factors, and so we neither normalize 
the similarity measures nor calculate the denomina- 
tor in equation (^). 

3.1 Task: Pseudo-word Sense 
Disambiguation 

In the usual word sense disambiguation problem, 
the method to be tested is presented with an am- 
biguous word in some context, and is asked to iden- 
tify the correct sense of the word from the context. 
For example, a test instance might be the sentence 
fragment "robbed the bank" ; the disambiguation 
method must decide whether "bank" refers to a river 
bank, a savings bank, or perhaps some other alter- 
native. 

While sense disambiguation is clearly an impor- 
tant task, it presents numerous experimental dif- 
ficulties. First, the very notion of "sense" is not 
clearly defined; for instance, dictionaries may pro- 
vide sense distinctions that are too fine or too coarse 
for the data at hand. Also, one needs to have train- 
ing data for which the correct senses have been as- 
signed, which can require considerable human effort. 

To circumvent these and other difficulties, we 
set up a pseudo-word disambiguation experiment 



( pchiitze, 1992| ; [Gale, Church, and Yarowsky 1992| ) 
the general format of which is as follows. Wc first 
construct a list of pseudo-words, each of which is 
the combination of two different words in V2. Each 
word in V2 contributes to exactly one pseudo-word. 
Then, we replace each W2 in the test set with its cor- 
responding pseudo-word. For example, if we choose 
to create a pseudo-word out of the words "make" 
and "take" , we would change the test data like this: 

make plans =^ {make, take} plans 
take action ^ {make, take} action 

The method being tested must choose between the 
two words that make up the pseudo-word. 

3.2 Data 



We used a statistical part-of-speech tagger ( Church 



1988) and pattern matching and concordancing tools 



(due to David Yarowsky) to identify transitive main 
verbs and head nouns of the corresponding direct 
objects in 44 million words of 1988 Associated Press 
newswire. We selected the noun-verb pairs for the 
1000 most frequent nouns in the corpus. These pairs 
are undoubtedly somewhat noisy given the errors 



name 


range 


base LM constraints 


tunc? 


D 


[0,oo] 


P{W2\W[) ^ if P(W2|W1) 7^ 


yes 


A 


[0,2 log 2] 


none 


yes 


L 


[0,2] 


none 


yes 


Pc 


[0, i niaxt„2 P(w2)] 


Bayes consistency 


no 



Table 1: Summary of similarity function properties 



inherent in the part-of-speech tagging and pattern 
matching. 

We used 80%, or 587833, of the pairs so de- 
rived, for building base bigram language models, 
reserving 20% for testing purposes. As some, but 
not all, of the similarity measures require smoothed 
language models, we calculated both a Katz back- 
off language model [P — P (equation (^), with 
Pr{'W2\'Wi) — P{w2)), and a maximum- likelihood 
model (P = /ml)- Furthermore, we wished to inves- 
tigate Katz's claim that one can delete singletons, 
word pairs that occur only once, from the train- 



ing s et without affecting model performance (Katz 



1987| ); our training set contained 82407 singletons. 
We therefore built four base language models, sum- 
marized in Table ^ 





with singletons 


no singletons 




(587833 pairs) 


(505426 pairs) 


MLE 


MLE-1 


MLE-ol 


Katz 


BO-1 


BO-ol 



Table 2: Base Language Models 

Since we wished to test the effectiveness of us- 
ing similarity for unseen word cooccurrences, we re- 
moved from the test set any verb-object pairs that 
occurred in the training set; this resulted in 17152 
unseen pairs (some occurred multiple times). The 
unseen pairs were further divided into five equal- 
sized parts, Ti through Ts, which formed the basis 
for fivefold cross-validation: in each of five runs, one 
of the Ti was used as a performance test set, with the 
other 4 sets combined into one set used for tuning 
parameters (if necessary) via a simple grid search. 
Finally, test pseudo-words were created from pairs 
of verbs with similar frequencies, so as to control for 
word frequency in the decision task. We use error 
rate as our performance metric, defined as 



— (# of incorrect choices 



(# of ties)/2) 



where N was the size of the test corpus. A tie occurs 
when the two words making up a pseudo-word are 
deemed equally likely. 



3.3 Baseline Experiments 

The performances of the four base language mod- 
els are shown in table |. MLE-1 and MLE-ol both 
have error rates of exactly .5 because the test sets 
consist of unseen bigrams, which are all assigned a 
probability of by maximum-likelihood estimates, 
and thus are all ties for this method. The back-off 
models BO-1 and BO-ol also perform similarly. 





Ti 


T2 


T3 






MLE-1 


.5 


.5 


.5 


.5 


.5 


MLE-ol 












BO-1 


0.517 


0.520 


0.512 


0.513 


0.516 


BO-ol 


0.517 


0.520 


0.512 


0.513 


0.516 



Table 3: Base Language Model Error Rates 

Since the back-off models consistently performed 
worse than the MLE models, we chose to use only the 
MLE models in our subsequent experiments. There- 
fore, we only ran comparisons between the mea- 
sures that could utilize unsmoothed data, namely, 
the Li norm, L{wi,'w'i); the total divergence to the 
average, A(wi,w'i); and the confusion probability, 
Pc{w[\wi). In the full paper, we give detailed ex- 
amples showing the different neighborhoods induced 
by the different measures, which we omit here for 
reasons of space. 

3.4 Performance of Similarity-Based 
Methods 

Figure |l] shows the results on the five test sets, us- 
ing MLE-1 as the base language model. The param- 
eter (3 was always set to the optimal value for the 
corresponding training set. RAND, which is shown 
for comparison purposes, simply chooses the weights 
W{'Wi,w'i) randomly. S{wi) was set equal to Vi in 
all cases. 

The similarity-based methods consistently outper- 
form the MLE method (which, recall, always has 
an error rate of .5) and Katz's back-off method 
(which always had an error rate of about .51) by 



•^It should be noted, however, that on BO-1 data, KL- 
divergence performed slightly better than the Li norm. 



a huge margin; therefore, we conclude that informa- 
tion from other word pairs is very useful for unseen 
pairs where unigram frequency is not informative. 
The similarity-based methods also do much better 
than RAND, which indicates that it is not enough 
to simply combine information from other words ar- 
bitrarily: it is quite important to take word simi- 
larity into account. In all cases, A edged out the 
other methods. The average improvement in using 
A instead of Pc is .0082; this difference is significant 
to the .1 level {p < .085), according to the paired 
t-test. 



ir Rates on Test Sets, Base Language Model MLE.ol 




"RANDMLEol" 
"CONFMLEol" 

; 1 "LMLEol" 

"AMLEol" 



Error Rates on Test Sets, Base Language MorJel MLE1 



"RANDWLEf 
"CONFMLE1" 
"LMLE1" 
"AWLEf 



Figure 1: Error rates for each test set, where the 
base language model was MLE-1. The methods, go- 
ing from left to right, are RAND , Pc, L, and A. 
The performances shown are for settings of (3 that 
were optimal for the corresponding training set. /? 
ranged from 4.0 to 4.5 for L and from 10 to 13 for 
A. 

The results for the MLE-ol case are depicted in 
figure H Again, we see the similarity-based meth- 
ods achieving far lower error rates than the MLE, 
back-off, and RAND methods, and again, A always 
performed the best. However, with singletons omit- 
ted the difference between A and Pc is even greater, 
the average difference being .024, which is significant 
to the .01 level (paired t-test). 

An important observation is that all methods, in- 
cluding RAND, were much more effective if single- 
tons were included in the base language model; thus, 
in the case of unseen word pairs, Katz's claim that 
singletons can be safely ignored in the back-off model 
does not hold for similarity-based models. 

4 Conclusions 

Similarity-based language models provide an ap- 
pealing approach for dealing with data sparseness. 
We have described and compared the performance 



Figure 2: Error rates for each test set, where the 
base language model was MLE-ol. (3 ranged from 6 
to 11 for L and from 21 to 22 for A. 



of four such models against two classical estima- 
tion methods, the MLE method and Katz's back-off 
scheme, on a pseudo-word disambiguation task. We 
observed that the similarity-based methods perform 
much better on unseen word pairs, with the measure 
based on the KL divergence to the average, being the 
best overall. 

We also investigated Katz's claim that one can 
discard singletons in the training data, resulting in 
a more compact language model, without significant 
loss of performance. Our results indicate that for 
similarity-based language modeling, singletons are 
quite important; their omission leads to significant 
degradation of performance. 
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